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Refinements of Stouts Procedure for Assessing 
Latent Trait Unidimensionality 


Abstract 


This paper provides a detailed investigation of Stout's statistical procedure (the 
computer program DIMTEST) for testing the hypothesis that an essentially 
unidimensional latent trait model fits observed binary item response data from a 
psychological test. One finding was that DIMTEST may fail to perform as desired in the 
presence of guessing when coupled with many high-discriminating items. A revision of 
DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised 
to determine the size of the assessment subtests. Further, an adjustment is made on the 
estimated standard error of the statistic on which DIMTEST depends. These three 
refinements have led to an improved procedure that is shown in simulation studies to 
adhere closely to the nominal level of significance while achieving considerably greater 
power. Finally, DIMTEST is validated on a selection of real data sets. 

Subject terms: Unidimensionality, essential independence, essential unidimensionality, 
DIMTEST, item response theory. 
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Refinements of Stout's Procedure for Assessing 
Latent Trait Unidiznensiosality 

Item response theory (IRT) is presently one of the most widely used techniques in 
psychometrics and is likely to remain so in the future. Some applications of IRT include 
ability estimation, item/test bias, equating, and adaptive testing. The three assumptions 
underlying many commonly used IRT models are monotonicity, unidimensionality (d= 1), 
and local independence (LI). Monotonicity assumes that the probability of correctly 
answering an item increases as ability increases. Uni dimensionality^ assumes that the items 
of a test measure a single ability. Local independence assumes that given any particular 
level of ability, responses to different items are independent. This paper is concerned with 
the statistical assessment of the assumption of unidimensionality. Most IRT models 
specifically require this assumption; moreover, classical test theory models implicitly 
assume that items measure the same dominant dimension. In spite of the importance of 
this assumption, it is also well known that actual data are rarely strictly unidimensional. It 
has long been argued that items are multiply determined and that, in addition to 
measuring the intended attribute, other attributes unique to individual items or common to 
relatively few items are unavoidable (Humphreys, 1981, 1985, 1986; Hambleton & 
Swaminathan, 1985; Reckase, 1979, 1985; Stout, 1987; Traub, 1983; Yen, 1985). In addition 
to the multiple item attributes that influence dimensionality, examinee characteristics such 
as differential teaching methods, the point of time during the instructional unit that the 
test is given, and so forth, also can influence the dimensionality of a set of items 
(Birenbaum & Tatsuoka, 1982; Bejar, 1983; Traub, 1983). Dimensionality is therefore a 
property of both the test and the examinee population taking the test (Reckase, 1990). 

Linear factor analysis (subjectively interpreted in the absence of a statistical 
distribution theory) has been the traditional approach for assessing the dimensionality of a 
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set of items. If the results of a linear factor analysis reveal only one significant factor, then 
the test is considered unidimensional. In the case of dichotomous data, however, it is well 
known that linear factor analysis of phi correlations between items often leads to 
overestimation of the number of factors underlying the item responses (Carroll, 1945; 
Hambleton & Swaminathan, 1985, Chapter 2; Hulin, Drasgow, & Parsons, 1983, Chapter 8; 
McDonald & Ahlwat, 1974). As a corrective alternative, tetrachoric correlations can be 
used for factor analysis. When guessing is present in the responses to items, however, linear 
factor analysis of tetrachoric correlations can produce a spurious factor due to difficulty of 
test items (Hulin, Drasgow, & Parsons, 1983, Chapter 8). In addition, computation of 
tetrachoric correlations can be problematic if any one of the correct/incorrect cells of the 
two—by—two item response tables contains a zero. Matrices of simple tetrachoric 
correlations are thus often non-gramian. As a result, conventional methods of factor 
analysis by phi or tetrachoric correlations are often unsatisfactory for assessing the 
dimensionality of test items. Christoffersson (1975) and Muthen (1978) have developed 
generalized least squares methods to overcome the problems with factor analysis of 
tetrachoric correlations, but their methods are limited to 25 items at most. Moreover, they 
are computationally intensive. 

In recent years a vast body of literature has been developed for assessing the 
dimensionality of test items. Comprehensive reviews of different procedures for assessing 
dimensionality are provided by Hattie (1984, 1985), and Hulin, Drasgow, and Parsons 
(1983, Chapter 8). Some of the more recent procedures developed to assess latent trait 
dimensionality include: maximum likelihood full information factor analysis (Bock, 
Gibbons, & Muraki, 1985); the Tucker and Humphreys procedures based on local 
independence and first and second factor loadings (Roznowski, Tucker, & Humphreys, 
1991); Stout's (1987) procedure for assessing unidimensionality based on the theory of 
essential independence; modified parallel analysis, which combines latent trait methods and 
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factor analysis and uses eigenvalues of tetrachoric correlation matrix (Hulin, Drasgow, & 
Parsons, 1983, p. 255); McDonald's nonlinear factor analysis (McDonald, 1962; McDonald 
& Ahlawat, 1974; Etazadi-Amoli & McDonald, 1983); Holland and Rosenbaum's test of 
unidimensionality, monotonidty, and conditional independence (Holland, 1981; Holland & 
Rosenbaum, 1986); residual analysis, determined by model—data fit (Hambleton & 
Swaminathan, 1985, Chapter, 8); and Bejar's procedure based on three—parameter logistic 
item parameter estimates (Bejar, 1980). Some of these methods are reviewed in Hambleton 
and Rovinelli (1986), Mislevy (1986), and Zwick (1987a). 

Although these different approaches offer promise for assessing the dimensionality of 
binary data, researchers in the field have not reached a consensus on one satisfactory 
method (Berger & Knol, 1990; Hambleton & Rovinelli, 1986; Hattie, 1984; Zwick, 1987b). 
Primarily, this is due to the fact that there is substantial confusion in the literature 
concerning the definition of unidimensionality. Additionally, many existing methods for 
assessing dimensionality are only loosely connected to the various definitions in the 
literature (Hambleton & Rovinelli, 1986). 

This article is concerned with Stout's procedure for assessing unidimensionality 
(DIMTEST). Stout (1987) has developed a nonparametric statistical procedure based on 
the large sample distribution theory for assessing latent trait dimensionality and has 
argued the validity of this procedure based upon simulation studies involving a variety of 
achievement tests. DIMTEST has been shown to discriminate well between one- and 
two-dimensional tests, maintaining good adherence to a specified level of significance when 
d= 1 and maintaining good power when d= 2, even when the correlation between the 
abilities is as high as .7. 

The present study provides a detailed investigation of certain performance 
characteristics and the consequent major refinements of DIMTEST for assessing latent 
trait dimensionality. DIMTEST was found to perform undesirably in certain cases where 
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the test contained many highly discriminating items with guessing present. A correction is 
proposed to overcome this limitation. In addition, an automatic approach is devised for 
determining M, the size of the assessment subtests; a better control of a, the specified level 
of significance, is achieved by adjusting the estimated standard error of Stout' s statistic T. 
These refinements have led to an improved test procedure that is easier to use and has been 
shown in simulation studies to adhere closely to the nominal level of significance while 
achieving considerably greater power. Finally, the procedure is applied to a selection of real 
data sets. 


Stout's Procedure for Assessing Uni dimensionality 

As stated in the beginning of this paper, items are multiply determined, and thus 
the number of dominant abilities should be assessed in testing for dimensionality. Stout 
first informally (1987) and then formally (1990) provided a definition of the number of 
dominant dimensions known as essential dimensionality, which is derived from the theory 
of essential independence. The statistical procedure for assessing essential unidimensionality 
is consistent with the definition of essential dimensionality. To assist the reader in 
evaluating this claim as well as to enable the reader in understanding the refinements made 
to DIMTEST, Stout's definition of essential dimensionality will be followed by a brief 
summary of the statistical procedure. The reader is advised, however, that use of 
DIMTEST does not require acceptance of Stout's notion of essential dimensionality, and, 
in fact, DIMTEST can also be viewed as a technique to detect sizable lack of fit of a locally 
independent unidimensional latent trait model. 

Let U\ denote the *-th item response and U^ 5 ( U p U 2> ... U^), denote the test 
response vector for an N -item test. Observed item and test values will be denoted by 
and = (UpUj.-.-u^), respectively. Let CT = 1 denote a correct response and U-=0 






Stout's procedure for unidimensionality — 6 


denote an incorrect response to item i for a randomly chosen examinee. The latent random 
vector is denoted by £ and the particular values it takes are denoted by 6. Let P£8) denote 
the probability that a randomly chosen examinee with ability 0 will get the i — th item 
correct. It is assumed that all item response functions P.(g) are monotone. Let H - ( U-, 
i>l f denote the item pool consisting of as its first IV items. The item pool is 
conceptualized as a result of continuing the test construction process in the same mann a- 
beyond the construction of the N items that make up the actual test Hm being studied. 

One advantage of using only the partially observed U instead of the actually observed 
to model the test is that a totally rigorous definition of the number of dominant dimensions 
can be given. These ideas are carefully and formally developed in Stout (1990) and 
constitute a large sample approach to test modeling. 


Definition 1 (Stout, 1990) The item pool Xfis said to be essentially independent (El) 
with respect to the latent variable Q if J7 satisfies 


for every j?. 


D ^ = 


l |Cov( ^,17,19 = 9)1 
l<i<j<N _ 



-» 0 as N —> m 


( 1 ) 


The distinction between local independence and essential independence is that local 
independence requires Cov {U^Uj\Q. = 0) = 0 for all 0; whereas, essential independence 
requires the average value of ] Cov ( U-\Q_ = 0) \ over all item pairs to be small in 
magnitude for all 8 as the test length increases. Hence, essential independence is a weaker 
assumption than local independence. 
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Definition 2 (Stout, 1990). The essential dimensionality {d£j of an item pool U is 
the minimal dimensionality (number of elements in g) necessary to satisfy the assumption 
of essential independence. When dg = 1, essential unidimensionality is said to hold. 

The reader should note that dg= 1 means that II has an IRT model for which 
essential independence holds for a unidimensional latent trait 6. The ordinary definition of 
IRT dimensionality is the same as Definition 2, but essential independence is replaced by 
local independence and SL replaced by Stout argues that the assumptions concerning 
local independence and the resulting ordinary IRT definition of dimensionality should often 
be replaced by the respective weaker assumptions concerning essential independence and 
essential dimensionality. Junker (1988,1991) has proved results concerning essential 
independence and, in particular, has derived statistical consistency results for maximum 
likelihood estimates of ability under the assumption of essential unidimensionality. 

It can now be clearly stated what assessing the hypothesis of essential 
unidimensionality means: among all the essentially independent monotone IRT models for 
SI, does there exists a unidimensional one? To answer this question, we assume both 
monotonicity and essential independence and assess the lack of fit of unidimensionality. 
This approach is similar to most other procedures for assessing data dimensionality, with 
the exception that essential rather than local independence is assumed. 

The statistical procedure for testing the null hypothesis of essential 
unidimensionality will be briefly described here. For further details see Stout (1987, Sec. 4). 
The N test items are split into two assessment subtests of length M each—called the 
Assessment 1 subtest (ATI) and the Assessment 2 subtest (AT2)—and a longer subtest 
called the partitioning subtest (PT) of length n (= N—2M). The M items for subtest ATI 
are selected to have the same dominant trait. This splitting can be done using either expert 
opinion or exploratory factor analysis. Whatever method used to select items of ATI, the 
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goal is to select a small subset of items (up to one—fourth of the total test length seems a 
good convention) that all measure the same dominant trait and, at the same time, are as 
dimensionally different as possible from the PT items. Once items for ATI are selected, a 
second set of M items is selected for AT2 from the remaining items so that AT2 items have 
a difficulty distribution similar to ATI items (Step 6, Stout, 1987). The remaining n (= 
N—2M) items then become the partitioning subtest PT. 

Each examinee is assigned to one of K subgroups according to his/her score on PT. 
After eliminating subgroups with too few examinees (•^ m j n =20 recommended), within each 

A O 

subgroup, Jfc, two variance estimates, the usual variance estimate (cr^), and the 

a 2 

"unidimensional" variance estimate j)> are computed using items of ATI. 

where 

- I^ =1 V Mand =y 1 



The difference in these variance estimates is then normalized by an appropriate 
normalizing constant 5^ and summed over subgroups to arrive at the statistic 
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*2 * 2 

Similarly, using items of AT2, the two variance estimates <t^ <Ty ^ and the 
standard error of estimate S^ are computed and nc.malized within each subgroup to arrive 
at the statistic Tq using formula (2). The statistic T to assess departure from essential 
unidimensionality is given by 


r=(r £ -r a )/^. (4) 

The null hypothesis of dg=l is rejected if T > Z Q , where Z Q is the upper 100(1—a) 
percentile of the standard normal distribution, and a being the desired level of significance. 


Correction for Bias in the Statistic by Introduction of t b 

Consider the statistical bias that would result if T^ rather than T were the statistic 

used to assess essential dimensionality. The above description shows that Stout's test is 

*2 

based on two variance estimates: the usual variance estimate and the uni dimensional 
variance estimate ay ^ If the items of the test measure one dominant trait, then the two 
subtests ATI and PT would contain essentially unidimensional items representing the same 
dominant trait. When the test length is both long and essentially unidimensional, 
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examinees within each subgroup can be assumed to be of approximately equal ability. 

“2 *2 

Consequently, it can be shown that the differences in the variance estimates <r ) c — a 'ufc 

computed using items in ATI, would be "small"; thus, using T^, the test will be assessed 

as essentially unidimensional. By contrast, if the test length is long and essentially 

multidimensional, the trait measured by items of ATI would be different from the trait(s) 

a o *9 

measured by the rest of the test, and the ATI differences tr^-cr^ ^ would not be small (see 
Stout, (1987) for the heuristics explaining why this holds), and T^ would conclude the test 
to be essentially multidimensional. 

In the case of a relatively short essentially unidimensional test, however, examinees 

within each subgroup are not likely to he approximately equal on the dominant trait 

measured by the test, thereby causing the differences o^-a^to be large. This improperly 

inflates the value of the statistic T^ and results in statistical bias. This bias is amplified if 

items of ATI are homogeneous with respect to item difficulty, which often occurs when 

ATI is selected by factor analysis. To correct for this preasymptotic bias in T^, AT2 is 

constructed so that items of ATI and AT2 are closely matched in their item difficulty 

distribution. It has been observed that subtests ATI and AT2 are both subject to similar 

amounts of pre—asymp' tic bias, but because AT2 is chosen to be similar to ATI in 

difficulty only, t b formed from AT2 will not be made larger by the presence of 

multidimensionality. Thus, as statistical experimental design ideas suggest, the bias is 

2 

cancelled by forming the difference statistic T (Step 6 , Stout, 1987). 

Avoiding Bias due to Guessing and High Discrimination of Items 

Test items usually differ with respect to their various measurement properties. 
There may be difficult items, easy items, high discrimination items, low discrimination 
items, and so on. The 80-item SAT-Verbal vocabulary test analyzed by Lord (1968) is uo 
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exception. Item parameter estimates for this test were obtained by LOGIST. DIMTEST 
with a specified level of significance a=.05 was applied to a three—parameter logistic, 
unidimensional, simulation model on various random subtests of 50 items selected from this 
test. For 100 replications of the DIMTEST on the simulated test data, 5 rejections of the 
hypothesis of d^r=l were observed—strongly confirming the unidimensional nature of the 
simulated data. Items of the SAT—Verbal test were then divided into two sets. One set 
consisted of items having discrimination parameters greater than 1.0 
(high—discriminations); other set consisted of items with discrimination parameters less 
than or equal to 1.0 (low-discriminations). DIMTEST was applied separately to each 
subtest and the results were markedly different. Note from the classical test theory 
perspective that the first test has high reliability and the second test low reliability. 


Table 1 


Table 1 displays the performance of the procedure, for both subtests, administered 

to 750, 1000, and 2000 examinees. In these simulations, seven items were selected in each of 

the assessment subtests based on factor analysis with a J • =20. The reported values in 

mm 

Table 1 are the number of rejections out of the 100 replications of DIMTEST. 

The number of rejections for the test with low—discriminations is what is to be 
expected on a uni dimensional test. However, the rejection rate for the test with 
high—discriminations far exceeds the nominal level of 5/100. Furthermore, as the number of 
examinees increases, the rejection rate also increases. 

This finding was confirmed in another unidimensional simulation, which used the 
ASVAB general science test as its basis. Item parameter estimates for this test were 
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obtained by Mislevy and Bock (1984). In this simulation a rejection rate of 13/100 was 
observed with a=.05. Further investigation showed that this elevated rejection rate was 
caused by a preponderance of difficult, highly discriminating items. Thus, there is evidence 
to show that if many items of a test are both highly discriminating and difficult with 
guessing present, the observed type-I error rate may be unacceptably inflated. 

In an attempt to determine the cause(s) for excess bias, Monte Carlo simulations 
were investigated extensively with tests of high-discriminating items. Recalling that items 
for ATI were chosen according to the magnitude of their loadings on the second extracted 
factor (Step 1, Stout, 1987), it was found that in the case of high-discriminations with 
guessing present (with dg= 1), the second factor was a very pronounced difficulty factor 
even though tetrachoric correlations were used. One of the characteristics of the difficulty 
factor is that very easy and very difficult items have high loadings of the opposite sign. In 
the case of high—discriminations, for unknown reasons, but likely due to the presence of 
guessing, most often very easy items tended to have larger factor loadings in magnitude on 
the second factor than the corresponding collection of very difficult items. Consequently, 
the easiest items tended to be selected for ATI. To control for statistical bias, DIMTEST 
then selects the easiest remaining items for AT2. Therefore, PT is left with mostly difficult 
items. Because examinees are grouped according to their scores on PT, which mostly 
consists mostly of difficult items in this case, the partitioning subtest (PT) tends to 
misclassify low ability examinees. This misclassification is made worse if guessing is 
allowed. Thus, examinee abilities within each assigned subgroup may vary considerably, 
leading to a serious violation of the fundamental assumption of essential independence 
within subgroups. This assumption is critical for the statistic T to adhere closely to the 
nominal level of significance. As a result, the values of the statistic T^ (computed from 
ATI) averaged around 10, the values of t b (computed from AT2) averaged around 7. 
Thus, the values of T = (T^-Tq)/^2 were so large that the hypothesis of dg = 1 was 
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3 

often rejected. Although t b is supposed to compensate for the bias in T^, the bias in 
was so large that compensation was ineffective. 

It is interesting to note that there are two reasons why the SAT subtest with 
low-discriminations failed to exhibit statistical bias. First, low-discriminations enhance 
the ability of AT2 to compensate (in a statistical, experimental design balancing sense^) 
for the bias contributed by items of ATI. Second, the SAT subtest with 
low-discriminations has a wider distribution of item difficulty, thereby tending to reduce 
the misclassification of examinees in the formation of subgroups. 

Another unidimensional simulation study was conducted with the same 
high-discriminations SAT items, but with all e-parameters set to zero, creating a 
high-discriminations 2PL model. There was 1 rejection out of 100 trials. Therefore, the 
presence of guessing coupled with high-discriminations seemed to have caused the inflated 
rejection rate. This is true because, without guessing in the model, a highly pronounced 
difficulty factor is unlikely to appear in the tetrachoric factor analysis and, in fact, did not 
appear in high-discriminations 2PL simulations. Moreover, eliminating guessing reduces 
the problem of misclassification of low—ability examinees. 

Based on the above findings, it was conjectured that when guessing/high 
discrimination items are present, the assignment of examinees to subgroups could be done 
more effectively using PT scores that were based on items that included easy items. This 
was achieved in the following way. First, items of ATI are checked statistically, using the 
Wilcoxon rank sum test, to test if the items of ATI are too easy as a group. If the 
Wilcoxon rank test rejects, the procedure is to replace these items with items of highest 

C £ 

loadings of the opposite sign so that they are still dimensionally homogeneous 0 ’. If the 
Wilcoxon rank test does not reject, items of ATI are retained. Algorithm 1 in the 
Appendix describes this procedure in detail. Items of AT2 are selected, as before, so that 
items in ATI and AT2 have approximately the same difficulty distribution. 






Stout's procedure for unidimensionality — 14 


Automating the Size M of Assessment Subtests 

As described previously, DIMTEST splits N items of the test into three subtests: 
ATI and AT2 of length M each, and PT of length n (= N- 2M). In all the simulation 
studies presented in Stout (1987), the size of the assessment subtests M was specified by 
the user a priori. For example, for a 30-item test, 5 or 7 items were used in each of the 
assessment subtests; for a 50-item test, 8 or 12 items were used. By contrast, our aim has 
been to develop an algorithm that automatically determines the size of assessment subtests 
according to the magnitude of item loadings on the second extracted factor. For most 
applications this would seem preferable to the selection of M, a priori, especially by a 
novice user. 

According to Stout's large sample theory for DIMTEST, M should be small 
compared to N. Extensive Monte Carlo simulations showed that a minimum of four items 
was needed in each of the assessment subtests in order to have reliable variance estimates 
(Nandakumar, 1987; Stout, 1984, p. 31). To determine the maximum size of M(Afax M) 
that will yield desirable results, three different sizes of A/were tried: Max M = 1/5 of the 
test length, Max M — 1/4 of the test length, and Max M = 1/3 of the test length. 

Similarly, to determine the minimum size of factor loading that should be used for 
assigning an item to ATI, three different "starting" values (Start) of factor loadings were 
tried: Start = .25, Start ~ .20, and Start = .15. An experimental design was set up for 
conducting simulations with all three sizes of Max M and with all three values of Start. For 
each combination of Max M and Start, both type-I error and power were observed over 
repeated trials of DIMTEST with tests of different types. To illustrate, let Max M= 1/5 
and Start — 0.25. Based on the loadings of the second factor, items with absolute loading 
greater than .25 are to be considered for ATI selection. The average item loading is 
computed for items with positive loadings and for items with negative loadings. The set 
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with the highest average loading, in absolute value, is selected for ATI and the size of this 
set determines M. If the minimum required number of items is not obtained with either 
positive or negative loadings, the start value is decreased by .05 until the minimum number 
of items is found. Similarly, if in the selected set more than 1/5 of the items have absolute 
loadings greater than .25, only 1/5 of the items with the highest loadings are included. 
Algorithm 2 in the Appendix describes this procedure in detail. The observation of type-I 
error and power for different values of Max M and Start revealed that Max M = 1/4 and 
Start — .15 yielded the most desirable results. These values were then used for selection of 
items in simulations reported in the Tables 3 through 7 of this paper. Other combinations 
of Max M and Start yielded either an observed type-I error rate that is too high or an 
observed power level that is too low. 

Standard Error Estimation in Stout's Statistic 

The general approach used in the development of Stout's statistic first derived an 
asymptotically valid test statistic and then made adjustments to optimize the 
pre-asymptotic behavior of the statistic, guided by Monte Carlo simulations. 

Stout's statistic to test the hypothesis of essential unidimensionality was built by 
combining information measuring the strength of evidence of the nonunidimensionality 
contributed by each of the k = 1 subgroups of examinees. That is, the goal was to 
construct a statistic using the quantities 



from k subgroups of examinees. Each X^ measures nonunidimensionality in the sense that 
X^- 0 when dg= 1, and X^ > 0 on average when dg > 1. The most obvious approach is to 
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add up the contributions of and then normalize this sum by an appropriate standard 
error of estimate. When unidimensionality holds, Stout (1987) found the estimated 
asymptotic variance of X^ to be 




( 6 ) 


leading to the statistic 


where 


T" =(T' l '-T' b ')/& 


T' • — 


v* 

L Jb=l 




75 


( 7 ) 


Result 6.1 of Stout (1987, page 599) suggests that under regularity conditions when d ET 1 ' 
T'l and T' • should be asymptotically N(0,1) as the number of examinees and the number 
of items both approach <d. Moreover, Result 6.4 of Stout (1987, page 601) states that both 
T'l and T'' should have asymptotic power one when d^> 1. 

Simulation studies conducted prior to the study reported in Stout (1987) showed 
that, for test lengths and examinee population sizes typically encountered in practice, the 
statistical test T'' falsely rejected the hypothesis of unidimensionality more frequently 

g 

than the nominal error rate . Two modifications for constructing T of (4) were then 
considered: (a) enlarge Sy to of (3) so that the values of T on the average would be 
smaller, thereby reducing the rate of occurrence of type-I error to close to or even below 
the nominal level, and (b) normalize each X^ by its estimated standard error and then sum 
(instead of first summing and then normalizing as in (7)). This modified statistic T was 
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used in simulation studies reported in Stout (1987). However, the observed average type-1 
error (.023) in Stout (1987, Table 2) was well below the nominal level (a = .05). 

Because S yielded too large an observed type-I error and 5^ yielded too small an 
observed type-I error, the following adjustment to the estimated standard error was 
considered in addition to S^ in the present study. 

S k = [^4,AT 4) + *4,*/^] ! J k W 

It can be seen that 


Furthermore, a basic question in constructing the statistic T was how to combine 
the building blocks of (5) into a single appropriately normalized statistic for testing for 
unidimensionality. That is, restricting attention to linear scoring, the search was for an 
appropriate choice of weights {a^ 1 < k < K} to form Three different 

weighting procedures were considered. Six new statistics through Tg, as described 
below, were derived as a result of using different weights and standard errors of estimates. 
The objective was to find an improved statistic with an increased observed type—I error to 
approximate the nominal level while maintaining or even improving the power. 

An estimator or test statistic is useful provided it centers on the appropriate 

ir 

parameter and had a small standard error. It can be shown that Var 
minimized, subject to the constraint a^ = 1, by setting = 
[l/var(j\rp]/E^_j[l/var(^fp], Based on this argument, the statistic T^ = 

( h,r T B, })/(% was constructed where 
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~ir 

b k 


.)i/a 


The statistic T 2 was constructed similar to the statistic T of (4) but with S£ as the 
estimated standard error. That is, T^ 2 is given by 


K X, 


t l, 2=-&{l km rw. 


The statistic I 3 was constructed with weights as in but with Sj. as the estimated 
standard error. That is, 3 is given by 


,K X 


L>3 (sij*) k=1 (sir 


Based upon the naive, intuitive idea that those subgroups with more examinees in 
them should receive more weight in the constructed statistic, two more definitions T^ and 
Tg were proposed,where 


;,4 - 2jy*- 


l(W A ’ 


respectively. 


Lastly, based upon Central Limit Theorem and contrasted with the statistic T of 


(3), the statistic Tg was derived where 
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h, e=< 14 > 

In summary, Stout's (1987) recommended statistics T as well as statistics and 
Tg use as the estimated standard error, and the statistics, and use as the 
estimated standard error. The statistics T^ and Tg use weights according to the principle 
of minimum variance with S^ and 5^ as the standard errors of estimates, respectively. The 
statistics and Tg use weights (the number of examinees in each cell) and 
respectively, with as the standard error of estimate. And finally the statistic Tg is based 

Q 

on the usual form of the Central Limit Theorem. 

We decided that statistics T^ = (T^ ,c T B, J/fi, i = 1,...6 with different weights 
and standard errors should provide an ample choice of statistics for a simulation study to 
assess whether an improved statistic can be obtained that would be better than using T. 

Monte Carlo Simulation Studies 

A Monte Carlo simulation study was undertaken to study the performance of 
DIMTEST after performing corrections for high-discriminations bias using the Wilcoxon 
rank sum test, automation of the size M of assessment subtests, and correction for the 
standard error of estimate. In all simulations, •f m | n = 20 was adopted. The simulation 
study was designed to be similar to Stout's (1987) study in order to compare the 
performance of the statistic before and after the proposed corrections. 

Two issues were of particular importance in the study: (a) how well the nominal 
level of significance specified by the user (o=.05) is approximated by the observed level of 
significance when d £T l10 ' and (b) how large the power of the statistical test was in 
various dg=2 settings. 
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The Preliminary Standard Error Study 

In a preliminary pilot simulation study, the performance of six different statistics 
Tj through Tg was studied and compared with T (after implementing corrections for 
high—discrimination with guessing and automated M) with respect to type—I error and 
power in various test settings. The results revealed that the statistic 7^ yielded a higher 
observed type—I error, closer to the nominal level, and a higher power than T. The statistic 
Tj yielded an unacceptably large type-I error; statistics T^, T^, T^, and Tg differed little 
in performance from T and thus would offer no advantage. Therefore, the statistic was 
used in the simulations described below, and the results were compared with simulations of 
Stout (1987), obtained by using the statistic T prior to the proposed corrections. That is, T 
used S k and does not correct for high-discriminating/guessing items, nor did it 
automatically select AT. By contrast, used corrected for 
high-discriminating/guessing items and automatically selected M. 

The Unidimensional Simulation Study 

The unidimensional, three-parameter logistic model was used to simulate the test 
data. In order for the simulated test data to reflect real data, item parameter estimates 
were obtained from real data sets for five different tests: SATV, ACTM, ACTE, ASVAB 
AS, ASVAB AR^. The distributions of item parameters for these five tests are given in 
Table 2, and show that the five tests differ not only in length but also in distribution of 
difficulty and discrimination parameters. For example, ACTE has the lowest mean and 
standard deviation of item discrimination parameters; ASVAB AR had the highest mean 
item discrimination; ASVAB AS had the highest standard deviation of item discrimination; 
etc. For each test type, two examinee sample sizes J were studied: 750 and 2000. With the 
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sample size of 750, 250 examinees were used for factor analytic selection of assessment 
items, while the reminder were used to compute the test statistic. With J = 2000, 500 
examinees were used for the factor analysis and the reminder were used for computing the 
statistic. 


Table 2 


Binary item responses were generated as explained below. Examinee abilities were 
randomly generated from the standard normal distribution. For each simulated examinee, 
the probability, P£9), of correctly answering each item was computed using the 
three—parameter unidimensional logistic model. If a uniform random deviate in the interval 
(0,1) was less than or equal to the computed probability 0), the examinee was 
considered to have answered the item correctly and was given a score of 1; otherwise, the 
examinee was given a score of 0. 

For each combination of test type and examinee size, DIMTEST (as here modified 
by the Wilcoxon rank sum test, automated M, and the alternate standard error of estimate 
Sj■ ') was replicated 100 times, with new examinee responses being simulated each time. 

The number of rejections out of 100 replications of testing the null hypothesis of essential 
unidimensionality is reported in Table 3. Because the test data is generated from a 
unidimensional model, the observed level of significance should be close to the nominal 
level, which was set to .05. 
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Table 3 and Table 4 


Table 3 shows the observed type-I error for all five simulated test types for 
different sample sizes. Of particular interest is the second column: rejection rates for the 
SATV high—discriminations. Contrasting these results with the rejection rates of Table 1 
shows that, with the proposed correction for excess bias (that is, the Wilcoxon rank sum 
test), the rejection rates have dropped to an acceptable level. For example, the rejection 
rate with 2000 examinees has dropped from 58 to 7. For other test types, the observed level 
of significance is also close to the nominal level. Table 4 compares these results with those 
of Stout (1987, Table 2) where the statistic T was used. The contents of Table 4 show that, 
as a consequence of the proposed refinements, the observed type-I error rate has increased 
or remained the same for all test types and sample sizes except for ASVAB AR with 2000 
examinees. The overall average observed type-I error has increased from .023 (Stout, 1987) 
to .045 and is very close to the nominal value of .05. In addition, for each one of the cell 
entries, there is no statistical evidence to reject the hypothesis that the nominal level of 
significance of .05 holds. That is, they are all consistent with a p-value of .05 . 

The Two-Dimensional Simulation Study 

The two-dimensional simulation study was modeled according to the 
multidimensional three-parameter logistic model with compensatory abilities (Reckase & 
Mckinley, 1983) given by: 







Stout's procedure for uuidimensionality — 23 


1—c- 


P l e v e 2 ) ~ c i + -T+ exp{-1.7l«J7 i 7+«2,{« 2 -i 2j )]> 


(15) 


Seven different test types were considered to study the power of the procedure after 
the proposed changes. Two-dimensional counterparts of the five tP>t types used in the 
unidimensional simulation study were simulated in the following manner. The 
discrimination parameters (a^, a^) of the two dimensions for each item were 
independently generated from a normal distribution: 

“li" N \$' fz]' 

where n and <r were the mean and the standard deviation of the distribution of 
discrimination parameters of the respective unidimensional test taken from Table 2. 
Likewise, and were assumed to be independent of each other for each item and were 
generated: 


f> u ~ Nfr, <r), b 2i ~ N(n, a), 

where n and a were the mean and the standard deviation of the distribution of difficulty 
parameters of the respective unidimensional test taken from Table 2. For example, to 
generate the two-dimensional counterpart of the SATV test, the a^'s and a^'s were 
generated independently from the normal distribution with mean 1.07/2 and standard 
deviation 4/|5. Similarly, the 6^'s and b 2 ^'s were generated independently from the 
normal distribution with mean .58 and standard deviation .88. Each test was taken to 
consist of N ^ "pure" items dependent on 9^ alone, N 2 "pure" items dependent on $ 2 alone, 
and N, ^ mixed items dependent on 9^ and & 2 - 
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Abilities Q = ( 6 .,, 82 ) were generated from a bivariate normal distribution with 
both means being zero and both variances being one. The correlation coefficient p between 
the abilities varied appropriately. The c-parameter was taken to be .20 for all items. 
Binary item responses were generated exactly as described for uni dimensional tests using 
(15). 

In addition to the five two-dimensional counterparts of unidimensional tests, two 
more tests, the ACT Mathematics Usage Form 8 B (ACTM 8 B) and the ACT Mathematics 
Usage Form 24B (ACTM24B) were used. For these two tests, estimated two-dimensional 
item parameters (Oj ., a^,) and ( 6 ^, & 2l ) were obtained from the American College Testing 
Program. Except for item parameter generation, which has been replaced by use of actual 
item parameter estimates, the responses for these two tests were simulated as described 
above 

For each of the seven test types, two examinee sample sizes 7 were considered—750 
and 2000—and two levels of correlation p were considered—.5 and .7. As in the 
'inidimensional study, when 7=750, 250 examinees were used for factor analysis, and, when 
7=2000, 500 examinees were used for factor analysis. For each combination of test type, 
examinee sample size, and level of correlation, DIMTEST (as modified by the Wilcoxon 
rank sum test, automated Af, and the alternate standard error of estimate 5^) was 
replicated 100 times, each time simulating new examinees. For the first five test types, a 
new set of item parameters was generated for each test after each 10 replications. The 
number of rejections over 100 replications is reported in Table 5 for each case. 


Table 5 and Table 6 
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In the case of dg= 2, one wants good power; that is, one wants F\T^>Z^ to be large 
for a broad range of realistic dg= 2 alternatives. The contents of Table 5 show that the 
power is extremely high for the case of p=.5 for both sample sizes. The power is very high 
for p—.7 w’th 2000 examinees, and the power is good for p —.7 with 750 examinees. These 
results are noteworthy, considering that all tests in the simulation study consist of at least 
one-third mixed items requiring knowledge of both traits to be answered correctly. 
Furthermore, it can be seen that as the sample size increases, the power also increases. 

Table 6 compares the results of the present study with the results of Stout's 
simulation study which uses the statistic T (1987, Table 6). It can be seen, as a 
consequence of the proposed refinements, that the power has increased for every test type, 
sample size, and level of correlation. On the average, power has gone up from 67 to 88 
rejections per 100 trials of the procedure for the case of p=.5 with 750 examinees, from 92 
to 99 rejections for the case of p—.5 with 2000 examinees, from 36 to 54 for the case of 
p= 7 with 750 examinees, and from 67 to 90 rejections for the case of p =.7 with 2000 
examinees. These average increases are large enough to be of practical importance. 

Real Data Study 

Four different data sets were used to examine the performance of DIMTEST on 
actual data. Data for two Armed Services Vocational Aptitude Batteries, used by the 
Department of Defense Student Testing Program in high schools and post-secondary 
schools, were obtained from Linn, Hastings, Hu, and Ryan (1987). These tests included 
Arithmetic Reasoning tests for Grades 10 and 12 (AR10 k AR12), each with 30 items and 
1984 and 1961 examinees, respectively. Two more data sets were obtained from American 
College Testi ?» ''ACT) Program. These included ACT mathematics usage Forms B and C 
(F29B k F29C), each with 40 items and 2491 and 2494 examinees, respectively. 
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DIMTEST was applied to each of the four data sets. In each data set, 500 examinees 
were randomly selected for factor analysis; the rest were used for computing the statistic. 
Examinees were randomly split into two groups, one group for performing factor analysis 
and the other for computing the statistic, 100 times—each time testing for the null 
hypothesis of essential unidimensionality. The number of rejections over 100 replications of 
the procedure noted. The results for all tests are tabulated in Table 7. 


Table 7 


The contents of Table 7 suggest that, according to the DIMTEST, AR10 and AR12 
should be assessed as essentially unidimensional tests while F29B and F29C should be 
assessed as multidimensional tests. Examination of items of F29B and F29C showed that 
these tests consist of items assessing knowledge of arithmetic and algebra operations, 
geometry, numeration, story problems, and advanced topics. Therefore, from the 
perspective of content, F29B and F29C would seem to be multidimensional tests measuring 
highly correlated abilities. The rejection rate for AR12 is slightly higher than expected for 
an essentially unidimensional test. One or two items highly influenced by another factor 
may contribute to this high rejection rate, or many items may be slightly influenced by a 
second factor. Further investigation is necessary to examine possible reasons. 

Summary and Discussion 

Detailed investigation of DIMTEST for assessing unidimensionality revealed certain 
limitations. It failed to perform desirably when the test consisted of predominantly 
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difficult, high-discrimination items coupled with guessing present. This limitation was 
overcome by a more appropriate selection of assessment items. Also, an automated 
approach was devised to determine the size of assessment subtests, and the estimate of the 
standard error of the statistic was adjusted to yield the desired level of significance for 
d ET l data and higher power for dg> 1 data. After the proposed refinements were 
implemented, DIMTEST was applied to a variety of simulated tests for different sample 
sizes; these tests were modeled on Stout's (1987) simulation study. 

Comparison of the results of the present study with the results of Stout’s (1987) 

study indicates that the proposed refinements have improved the observed level of 

significance. It is now dose to the nominal level for d^= 1 simulations and has considerably 

increased the power for dg=2 simulations for different levels of correlations and sample 

sizes. In addition, the procedure has been used on a number of real data sets. The results of 

the real test data study seem to confirm the a priori hypotheses regarding the 

13 

dimensionality of these tests. 

The refinements have led to a revised test procedure that is, in particular, more 
robust against unusually high-discrimination parameters with guessing present and that, in 
general, is able to perform more desirably with respect to type—I and type—II errors. 
Moreover, the procedure is automated and totally data-dependent in its selection of 
assessment subtest items, making it more user friendly. The automation of the size of the 
assessment subtests could especially benefit the novice user. Because the power of the 
statistical test heavily relies upon appropriate selection of items for ATI, our simulation 
study provides further evidence that the use of linear factor analysis for selection of these 
items is a promising approach that requires little effort on the part of the user. 

When the statistical test rejects the null hypothesis of essential unidimensionality, 
it is possible to proceed in several ways. One approach would be to reexamine the test and 
assess the complexity of the essential multidimensionality present using DIMTEST, 
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NOFA, and so forth. If inference suggests that each of the different dominant traits 
influences a distinct group of items (i.e., there is a pronounced simple structure), the test 
could be split into several essentially unidimensional subtests, and each one could be 
analyzed separately using unidimensional IRT models. Alternately, if most of the items of 
the test are each influenced simultaneously by several dominant dimensions, then the 
researcher may need to resort to multidimensional parametric models in order to make 
inferences about the test data (Reckase, 1985,1989). 

The dimensionality of a set of item responses is conceptually very complex. It is a 
function of items, examinees, and extraneous factors such as type of instruction and stage 
of learning. Also, dimensionality is, from the practical perspective, a continuum. Because 
items are multiply determined, among finite length tests (the only kind available in 
applications), there is no such thing as a strictly unidimensional test. But we can still 
describe a given set of item responses as being well modeled by an essentially 
unidimensional test model. Junker (1990,1991) argues that an index for the continuum of 
dimensionality should be developed with strict unidimensionality, in the sense of fitting 
local independence models on one end and strict essential multidimensionality on the other 
end, with essential unidimensionality in between. Junker and Stout (1991) have developed 
indices for lack of essential unidimensionality, which can be extremely useful for assessing 
the degree of lack of essential unidimensionality when Stout's test of dg=l is rejected. 
Additionally, these indices show when it is safe to use unidimensional estimation 
procedures such as LOGIST or BILOG to arrive at accurate ability estimates. The 
conjecture is that lack of strict uni dimensionality is not detrimental, provided dg=l 
modeling provides a good approximation to reality. The number of items influenced by the 
secondary dimensions, as well as the strength of the influence of secondary dimensions, on 
each item should determine how strong the lack of d^= 1 is. Nandakumar (1991) has 
demonstrated the utility of DIMTEST in assessing essential uni dimensionality when test 
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items were influenced by various dimensions to various degrees, and thus strict 
dimensionality exceeded one. Nandakumar has found that the accuracy of the 
approximation of essential unidi mens ionality for a test is a function of the proportion of 
test items influenced by the various nondominant traits present and by the strength of the 
influence of these traits. 

Stout's procedure seems very promising for assessing the dimensionality underlying 
a set of items. It is an outgrowth of the conceptual definition of essential unidimensionality 
and was developed to be sensitive to dominant dimensions and insensitive to transient or 
minor dimensions. The procedure is nonparametric (thus avoiding parametric model-data 
fit problems), supported by an asymptotic theory, and is computationally simplistic. 
However, the procedure is relatively new, and its applicability in a variety of realistic 
applications needs to be studied further. Software to run DIMTEST is available from the 
authors. 
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Appendix 

Algorithm 1: Test for Difficulty Factor 

1. Rank the iV items from most difficult (rank 1) to easiest (rank N). 

2. Compute the sum W $ of the ranks of the M items in ATI. 

3. Compute the mean E( W) and the standard deviation SD( W) of the 

sum W s under the assumption of randomly distributed ranks: 

E(W S ) = \M{N+1) 

SD( W 3 ) = (Jj M{N-M)(N+l)) l l 2 . 

4. Compute the critical value C for W under the usual large sample 

* 

approximation: 

where is the upper 100(1—a)th percentile of the standard normal 
distribution and a is the desired level of significance. 

5. If W $ > C, conclude that M items in ATI are too easy. 
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Algorithm 2: The Size M of Assessment Subtests 


Let N = total number of items, Allow = 4, Mhigh = 


N 

T 


Maxload = .15. 


and 


1. Compute 

a) Iy= Number of positive loadings > Maxload. 

b) I .2 = Number of negative loadings < —Maxload. 

2. Redefine 

:= min (Mhigh, 7^) 

7j := min (Afhigh, 7 2 ). 


3. If both 7j < Aflow and 7 2 < Allow, then define 

Maxload := Maxload — .05. 

Go to Step 1. 

4. If either 7^ or 7j is > Allow, then let 


M = 



if 7 1 > Aflow 
if 7j > Aflow 


5. If both 7j > Aflow and / 2 > Aflow, then compute the averages Avgl 
and Avg2 of item loadings for sets corresponding to 7^ and 
7 2 respectively. Let 

’7j if Avgl > Avg2 

M = h if Avg2 > Avgl 

M<u(I v 7 2 ) if Avgl = Avg2 
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Notes 

^Throughout, we speak of a unidimensional test, a unidimensional set of items, etc. This 
convenient phrasing represents the more complex reality that the dimensionality of a model 
or a data set rests on the joint influence of test items and examinee population. Items and 
examinees together produce test data that we judge by statistical inference to be 
unidimensional or not. Reckase (1990) writes perceptively on this point. Technically, IRT 
dimensionality is usually defined to be the lowest latent space dimension possible, such 
that monotonicity and local independence hold. 

o 

Note that the statistic T^ computed from ATI is sensitive to dimensionality (that is, it 
can discriminate between dg=l vs d^>l) and to sources of bias. The idea in introducing 
AT2 is to deliberately make t b sensitive only to sources of bias but not to dimensionality. 

3 In unidimensional settings where the procedure worked well, typical values of ranged 
roughly from 1 to 5, and typical values of t b ranged roughly between .6 to 4.0; thus 
typical values of T ranged roughly between —1.0 to 1.5. 

4 

If a randomized block design with M blocks of size 2 is to be used in an experiment with 
human subjects assigned to control and treatment groups, this experimental design 
technique will work well unless the subjects are too variable. By rough analogy, the higher 
the discrimination parameters, the more "variable” are the items that are being assigned to 
ATI and AT2 and the less effective the difficulty matching method (analogous to blocking) 
of AT2 item selection is in eliminating bias. 

5 

It can be observed that when items in ATI are replaced (because they are too easy) with 
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items of high loadings of the opposite sign, easy PT items could result, thereby causing 
inaccuracy in subgroup assignment of high—scoring examinees. Simulation results have 
shown that this potential inaccuracy is not as detrimental to the value of the statistic T as 
it was when PT had mostly difficult items. 

We also tried to correct tetrachoric correlations for guessing by following Bock, Gibbons, 
and Muraki (1985) and by using nonlinear factor analyses to diminish the influence of 
difficulty on the second factor loadings. Regarding correction of tetrachorics, we found that 
when guessing values were about .2 in the model, a large percentage of the sample 
correlations was computed as 1 or —1. However, when the guessing levels were arbitrarily 
cut by half, the problem of extreme correlations was reduced. Even with this reduction of 
guessing levels, the items selected for ATI did not differ significantly from those selected 
without correction for guessing. Moreover, the ad hoc method of cutting guessing levels 
defeats the purpose of using the three—parameter logistic model. Therefore correction for 
guessing was not implemented. 

The nonlinear factor extraction program NOFA was used to select items for the 
assessment subtests. We tried two-factor quadratic model for this purpose. In comparing 
the results of linear and nonlinear factor analysis, we found no difference in T—values 
between the two methods. To our surprise the difficulty factor reappeared even with 
nonlinear factor analysis. Therefore, we did not implement nonlinear factor analysis. 

^The reason for the word "suggests” instead of "establishes" is that Stout's result actually 
assumes unidimensionality under the stronger assumption of local independence. Further, 
the asymptotic invariance in (6) also assumes the stronger assumption of local 
independence. 
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8 2 

(Sy) is an asymptotic variance and fails to account for the overdispersion of 5^ that 
occurs as examinees in a fixed PT subgroup have varying abilities, even though the test is 
unidimensional. Thus, (5^') will underestimate the true standard error and will yield too 
large a type-I error (see Cox & Snell ,1989, pp 106—110 for <. nice discussion of 
overdispersion resulting from varying parameter such as ability). 

Q 

There are, of course, many more possibilities for computing statistics with given weights 
and standard errors of estimate, but those described here were considered the most 
appropriate. 

^Technically, our simulations were done with d=l, implying d E~ h For simulation studies 
for which d^=l, see Nandakumar (1991). 

**The SATV denotes the SAT—verbal test obtained from Lord (1968); ACTM denotes the 
ACT mathematics usage test, and ACTE denotes the ACT English Usage test, both 
obtained from Drasgow (1987); ASVAB AS and ASVAB AE denote the Armed Services 
Vocational Aptitude test Battery, Auto Shop Information and Arithmetic Reasoning 
respectively, both obtained from Mislevy and Bock (1984). 

12 

The standard error for testing the hypothesis of p=.05 vs p^.05 is approximately 2.2 
trials. Thus, the acceptance region of this test for a set of 100 simulations is given by (.7, 
9.3) trials. 

13 

We say "seem to" because one cannot really know that a real data set is dg= 1 or dy> 1. 
Further, the 100 replications of Table 7 are not the result of 100 administrations of the test 
to similar examinee populations, but rather 100 variations of the application of the statistic 
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to one data set that resulted from one administration of the test. 
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Table 1 

Rejection rates per 100 trials for d^=l simulation study using 
estimated item parameters of SAT verbal test with a= 0.05 

Discrimination Number of Number of examinees 

parameter items - 

750 1000 2000 

0 < «. < 1.0 41 4 0 3 

(low-discriminations) 

1.1 < a- < 2.0 39 28 46 58 

(high-discriminations) 









Table 2 

Sample distributions of item parameters for the five 
standardized tests used in the study 


** 

N : 

SATV* 

80 

ACTM 

40 

ACTE 

75 

ASVAB 

AS 

25 

ASVAB 

AR 

30 

Max a^’s 

2.00 

2.00 

1.58 

2.82 

2.76 

Min a^’s 

0.40 

0.40 

0.11 

0.32 

0.50 

Mean a^’s 

1.07 

1.09 

0.72 

1.22 

1.46 

S.D a i ’s 

0.40 

0.35 

0.25 

0.70 

0.51 

Max b^s 

2.50 

1.50 

2.07 

1.27 

1.01 

Min b i ’s 

-1.50 

-1.02 

-3.11 

-1.39 

-2.72 

Mean b^’s 

0.58 

0.50 

0.03 

0,09 

-0.02 

S.D b^s 

0.88 

0.61 

0.96 

0.72 

0.84 

Max c^’s 

0.20 

0.21 

0.27 

0.26 

0.34 

Min c^’s 

0.04 

0.02 

0.04 

0.06 

0.08 

Mean c i ’s 

0.16 

0.14 

0.15 

0.20 

0.19 

S.D c^s 

0.05 

0.04 

0.03 

0.04 

0.06 


N denotes the test length. 

SATV denotes the SAT verbal test battery. 

ACTX denotes the ACT mathematics usage test battery. 
ACTE denotes the ACT English usage test battery. 
ASVAB AS denotes the Armed Services Vocational 
Aptitude Battery for auto shop information. 

ASVAB AR denotes the Armed Services Vocational 
Aptitude Battery for arithmetic reasoning. 








Table 3 

Results of unidimensional simulation study: Rejection rates for testing 
the null hypothesis of dg=l over 100 trials with c=.20 and a=.05 


J 

SATV* 

SATV 

high dis 

ACT! 

ACTE 

ASVAB AS 

ASVAB AR 

750 

6 

8 

5 

6 

2 

3 

2000 

6 

7 

4 

4 

2 

1 


* 

SATV and ACTE each contain more than 50 items in the pool, but 50 items 
were randomly selected for the study. After each 10 of 100 trials a new 
sample of 50 items was chosen. For other tests the same test was used for 
all 100 trials. 







Table 4 

Comparison of unidimensional simulation study results of this paper with 
those in Stout (1987): Rejection rates for testing the null hypothesis of 
over 100 trials with c=.2 and a=.05 




SATV 


ACTM 

ACTE 

ASVAB AS 

ASVAB AR 

Study 

750 

2000 

750 

2000 

750 

2000 

750 

2000 

750 

2000 

Stout*(1987) 

2 

6 

1 

4 

3 

1 

1 

1 

2 

4 

Present 

6 

6 

5 

4 

6 

4 

2 

2 

3 

1 


* 

For all tests the rejection rate reported is the average of rejection 
rates (rounded to nearest integer) for the two different M values 
reported in Table 2 of Stout (1987). 





Table 5 

Results of two-dimensional simulation study: Rejection rates for testing 
the null hypothesis of dg=1 over 100 trials with c=.20 and a=.05 


SATV ACT1 ACTE ASVAB AR ASVAB AR ACTM24B ACTM8B 
M 1 M 2 M Z 17-17-16 !3-13-l4 17-17-16 8-8-9 10-10-10 0-0-40 0-0-50 

J 750 2000 750 2000 750 2000 750 2000 750 2000 2000 2000 

p = .5 93 100 97 100 81 100 73 99 94 98 99 100 

p = .7 58 96 66 97 37 83 50 83 61 91 69 98 












Table 6 

Coarparison of two-dimensional simulation study results of this paper with 
those in Stout (1987): Rejection rates over 100 trials for testing the 
null hypothesis dg=1 with c=.2, and a=.05 



h'h'h 1 

SATV 

17-17-16 

ACTM 

13-13-14 

ACTE 

17-17-16 

ASVAB AS 
8-8-9 

ASVAB AR 
10-10-10 


j 

750 2000 

750 

2000 

750 

2000 

750 

2000 

750 

2000 


Stout*(1987) 

62 98 

69 


59 

90 

. 

87 

76 


p~. 5 


Present 

93 100 

97 

100 

81 

100 

73 

99 

94 

98 


Stout*(1987) 

36 83 

- 

74 

_ 

55 

_ 

54 

_ 

67 

7 


Present 

58 96 

66 

97 

37 

83 

50 

83 

61 

91 


For all tests the rejection rate reported is the average of rejection 
rates (rounded to nearest integer) for the two different M values reported 
in Table 6 of Stout (1987). 






Table 7 

Results of real data study: Rejection rates for testing 
the null hypothesis of dg=l over 100 replications of random 

selection of subjects with o=.05 


AR10 

AR12 

F29B 

F29C 

M: 30 

30 

40 

40 

J : 1984 

1961 

2491 

2494 


6 


13 


86 


82 
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